{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word对象结构图\n",
    "\n",
    "<table width=\"100%\" border=\"0\">\n",
    "<tr>\n",
    "  <td valign=\"top\" width=\"50%\"  style=\"background-color:#ddffdd\">\n",
    "    <img src=\"images/Word对象结构图.png\" align=\"center\"/>\n",
    "  </td>\n",
    "  <td style=\"vertical-align:top; text-align:left;background-color:#ddffdd\">\n",
    "      \n",
    " ## 左图列出了常用的对象（实际上结构非常复杂）\n",
    "### 1. Document 对象表示整个文档\n",
    "### 2. Paragraph 对象表示段落（每一次回车会产生新段落）\n",
    "### 3. Run 对象表示相同样式的文本延续，称为文字块\n",
    "### 4. Table 对象表示文档中的表格\n",
    "<hr/>\n",
    "    \n",
    "* <font color=\"blue\" size=\"3px\"> \n",
    "     Document 对象包含多个 Paragraph 对象（列表），Paragraph 对象包含多个 Run 对象（列表）</font>\n",
    "\n",
    "\n",
    "* <font color=\"blue\" size=\"3px\">  Table 对象包含多个单元格 Cell 对象，Cell 对象又可以包含 Paragraph 对象的列表</font>\n",
    " \n",
    "\n",
    "\n",
    "  </td>\n",
    " </tr>\n",
    " <tr>\n",
    "  <td valign=\"top\" width=\"50%\" style=\"background-color:#ffdddd\">\n",
    "    <img src=\"images/word文档结构介绍.png\" align=\"center\"/> \n",
    "   </td>\n",
    "   <td style=\"vertical-align:top; text-align:left;background-color:#ffdddd\">\n",
    "      \n",
    " ## 重点讲解\n",
    "### 1、Document 文档\n",
    "### 2、Paragraph 段落（每一次回车会产生新段落）\n",
    "### 3、Run 文字块（相同样式的文本延续）\n",
    "\n",
    "  </td>\n",
    " </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Anaconda 默认不包含 docx工具包，需要执行以下脚本进行安装\n",
    "!pip install -i https://mirrors.aliyun.com/pypi/simple/ python-docx"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入word工具包\n",
    "import docx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1、读取Word文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 文件路径\n",
    "docx_file = r\".\\files\\Word_读取说明.docx\"\n",
    "\n",
    "# 打开指定文件路径的 Word 文档\n",
    "document = docx.Document(docx_file)\n",
    "# 打印简单的信息看看\n",
    "print(\"有 %s 个段落\" % len(document.paragraphs))\n",
    "print(\"有 %s 个表格\" % len(document.tables))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2、查看段落 Paragraph 的内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 文档中的 paragraphs 段落对象列表\n",
    "for paragraph in document.paragraphs:\n",
    "    print(paragraph.text)\n",
    "# 段落的文本如下："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3.1、查看行内元素 Run 的内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 文档中的 paragraphs 段落对象列表\n",
    "for paragraph in document.paragraphs:\n",
    "    # 段落对象中的 runs 文本块对象列表\n",
    "    for r in paragraph.runs:\n",
    "        print(r.text)\n",
    "# 文字块：相同样式的文本延续\n",
    "# 文字块文本如下（未包含样式信息）："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3.2、语法知识"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1）+= 用法（字符串、整数）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num = 1 # 整数\n",
    "for i in range(3):\n",
    "    num += 1  # 语法相当于 num = num + 1 ，就是每循环一次 num 就加 1\n",
    "    print(num)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s =\"开始|\" # 字符串\n",
    "for i in range(3):\n",
    "    value = \"当前数:%s|\" % i\n",
    "    s += value  # 语法相当于 s = s + value ，就是每循环一次 s 后面就追加 value 字符串\n",
    "    print(s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2）截取字符串"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 语法：字符串 [ 开始索引 : 结束索引 ]\n",
    "print(\"1.       s =\",s) # 全部字符串\n",
    "print(\"2.   s[0:] =\",s[0:]) # 从 0（第一位） 开始取直到字符串结束，也是全部字符串\n",
    "print(\"3.   s[3:] =\",s[3:]) # 从 3 （第四位）开始取直到字符串结束\n",
    "print(\"4.    s[3] =\",s[3])  # 取第 3 （第四位）个字符\n",
    "print(\"5.   s[-2] =\",s[-2]) # 取倒数第 2 （倒数就还是第2）个字符\n",
    "print(\"6. s[0:-2] =\",s[0:-2]) # 从 0（第一位） 开始取直到字符串倒数第2结束"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3）字符串格式化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 定义变量\n",
    "num = 123\n",
    "text = \"计算结果\"\n",
    "# 字符串前面加“f”（format首字母）后，该字符串可以使用{变量}的值\n",
    "print(\"  加 f ->\", f\"{text}={num}\") # 加 f \n",
    "print(\"不加 f ->\",\"{text}={num}\") # 不加 f"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3.3、查看行内元素 Run 的内容和样式"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 文档中的 paragraphs 段落对象列表\n",
    "for paragraph in document.paragraphs:\n",
    "    # 段落对象中的 runs 文本块对象列表\n",
    "    for r in paragraph.runs:\n",
    "        # ** 注意：没有样式说明是默认的样式 **\n",
    "        r_info = \"\" # 存放一个文本块内容和样式信息，print 结果\n",
    "        if r.font.name is not None: # 如果字体名称不为空（None），显示字体名\n",
    "            r_info += r.font.name + \",\"\n",
    "        if r.font.size is not None: # 如果字号不为空，显示字号\n",
    "            r_info += \"字号:\" + str(round(int(r.font.size)/12700)) + \",\"\n",
    "        if r.bold: # 如果是粗体字\n",
    "            r_info += \"粗体字,\"\n",
    "        if r.italic: # 如果是斜体字\n",
    "            r_info += \"斜体字,\"\n",
    "        if r.font.color.rgb is not None: # 如果文字有颜色，显示字体颜色\n",
    "            r_info += \"颜色:\" + str(r.font.color.rgb) + \",\"\n",
    "        # 如果长度大于0，说明 r_info 有样式信息\n",
    "        if len(r_info) > 0:\n",
    "            # r_info 中最后一个字符肯定有一个“逗号”\n",
    "            r_info = r_info[0:-1] # 去掉末尾的“逗号”\n",
    "            r_info = r.text + f\"【{r_info}】\"\n",
    "        else: # 没有样式信息就直接显示文字\n",
    "            r_info = r.text\n",
    "        print(r_info)\n",
    "\n",
    "# 文字块：相同样式的文本延续\n",
    "# ** 注意：没有样式说明是默认的样式 **\n",
    "# 包含样式的文字块文本如下："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4、查看表格的内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 文档中的 tables 表格对象列表\n",
    "for table in document.tables:\n",
    "    # rows 为行对象列表\n",
    "    for row in table.rows:\n",
    "        # cells 为单元格对象列表\n",
    "        print(row.cells)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 文档中的 tables 表格对象列表\n",
    "for table in document.tables:\n",
    "    # 打印便于简单的展示表格\n",
    "    print(\"---------------------------------\")\n",
    "    for row in table.rows:\n",
    "        row_text = \"\"\n",
    "        for cell in row.cells:\n",
    "            # \\t 代表的意思是水平制表符，能否保证列尽可能的对齐\n",
    "            row_text += cell.text + \"\\t|\\t\"             \n",
    "        print(row_text[0:-3]) # [0:-3] 去掉最后的 \\t|\\t\n",
    "        # 打印便于简单的展示表格\n",
    "        print(\"---------------------------------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5、查看图片"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def show_img(blob):\n",
    "    \"\"\"\n",
    "    显示图片的函数（以后的课程再讲解）\n",
    "    \"\"\"\n",
    "    from PIL import Image # 图片处理工具包\n",
    "    from io import BytesIO # 字节流工具包\n",
    "    import matplotlib.pyplot as plt # 图片显示\n",
    "    plt.figure('image',figsize=(16,12))\n",
    "    plt.xticks([])\n",
    "    plt.yticks([])\n",
    "    im = Image.open(BytesIO(blob))\n",
    "    plt.imshow(im)\n",
    "    plt.show()\n",
    "\n",
    "# 读取 Word 文档中的图片（工具包中没有直接提供图片对象列表）\n",
    "print(\"*************** 图片 **************\")\n",
    "dict_rel = document.part._rels\n",
    "for rel in dict_rel:\n",
    "    rel = dict_rel[rel]\n",
    "\n",
    "    if \"image\" in rel.reltype:\n",
    "        show_img(rel.target_part.blob)\n",
    "    else:\n",
    "        # word文档非常复杂，暂不扩展课程内容\n",
    "        # 我们把 word的扩展名改一下，您会发现确实复杂\n",
    "        pass # 这个语法就是 pass 通过，什么事情都不做，以后会经常见到"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6、功能合并到函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import docx # 导入word工具包\n",
    "\n",
    "\n",
    "#读取docx中的文本代码示例\n",
    "def read_docx(docx_file):\n",
    "    # 打开 Word 文档\n",
    "    document = docx.Document(docx_file)\n",
    "    # 读取 Word 文档中的段落\n",
    "    # 文档中的 paragraphs 段落对象列表\n",
    "    for paragraph in document.paragraphs:\n",
    "        # 段落对象中的 runs 文本块对象列表\n",
    "        for r in paragraph.runs:\n",
    "            r_info = \"\" # 存放一个文本块内容和样式信息，print 结果\n",
    "            if r.font.name is not None: # 如果字体名称不为空，显示字体名\n",
    "                r_info += r.font.name + \",\"\n",
    "            if r.font.size is not None: # 如果字号不为空，显示字号\n",
    "                r_info += \"字号:\" + str(round(int(r.font.size)/12700)) + \",\"\n",
    "            if r.bold: # 如果是粗体字\n",
    "                r_info += \"粗体字,\"\n",
    "            if r.italic: # 如果是斜体字\n",
    "                r_info += \"斜体字,\"\n",
    "            if r.font.color.rgb is not None: # 如果文字有颜色，显示字体颜色\n",
    "                r_info += \"颜色:\" + str(r.font.color.rgb) + \",\"\n",
    "            # 如果长度大于0，说明 r_info 有样式信息\n",
    "            if len(r_info) > 0:\n",
    "                # r_info 中最后一个字符肯定有一个“逗号”\n",
    "                r_info = r_info[0:-1] # 去掉末尾的“逗号”\n",
    "                r_info = r.text + f\"【{r_info}】\"\n",
    "            else: # 没有样式信息就直接显示文字\n",
    "                r_info = r.text\n",
    "            print(r_info)\n",
    "            \n",
    "    # 文档中的 tables 表格对象列表\n",
    "    for table in document.tables:\n",
    "        # 打印便于简单的展示表格\n",
    "        print(\"---------------------------------\")\n",
    "        for row in table.rows:\n",
    "            row_text = \"\"\n",
    "            for cell in row.cells:\n",
    "                row_text += cell.text + \"\\t|\\t\" # \\t 代表的意思是水平制表符，能否保证列尽可能的对齐\n",
    "            print(row_text[0:-3]) # [0:-3] 去掉最后的 \\t|\\t\n",
    "            # 打印便于简单的展示表格\n",
    "            print(\"---------------------------------\")\n",
    "\n",
    "    def show_img(blob):\n",
    "        \"\"\"\n",
    "        显示图片的函数（以后的课程再讲解）\n",
    "        \"\"\"\n",
    "        from PIL import Image # 图片处理工具包\n",
    "        from io import BytesIO # 字节流工具包\n",
    "        import matplotlib.pyplot as plt # 图片显示\n",
    "        plt.figure('image',figsize=(16,12))\n",
    "        plt.xticks([])\n",
    "        plt.yticks([])\n",
    "        im = Image.open(BytesIO(blob))\n",
    "        plt.imshow(im)\n",
    "        plt.show()\n",
    "\n",
    "    # 读取 Word 文档中的图片（工具包中没有直接提供图片对象列表）\n",
    "    dict_rel = document.part._rels\n",
    "    for rel in dict_rel:\n",
    "        rel = dict_rel[rel]\n",
    "\n",
    "        if \"image\" in rel.reltype:\n",
    "            show_img(rel.target_part.blob)\n",
    "        else:\n",
    "            # word文档非常复杂，暂不扩展课程内容\n",
    "            # 我们把 word的扩展名改一下，您会发现确实复杂\n",
    "            pass # 这个语法就是 pass 通过，什么事情都不做，以后会经常见到  \n",
    "\n",
    "read_docx(r\".\\files\\Word_读取说明.docx\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python总结"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**<font color=blue size=4> 1、掌握Word文档中常用对象 </font>**\n",
    "\n",
    "* <font color=blue size=4>Document 文档对象</font>\n",
    "\n",
    "* <font color=blue size=4>Paragraph 段落对象</font>\n",
    "\n",
    "* <font color=blue size=4>Run 文字块对象</font>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**<font color=blue size=4> 2、语法知识 </font>**\n",
    "\n",
    "* <font color=blue size=4> += 用法（字符串、整数）</font>\n",
    "\n",
    "* <font color=blue size=4>截取字符串</font>\n",
    "\n",
    "* <font color=blue size=4>字符串格式化</font>\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 思考题"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 运行下面的代码，并打开文件 **files\\Word结构.docx**，对比一下输出结果，理解一下 **Paragraph段落** 和 **Run文字块**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "read_docx(r\".\\files\\Word结构.docx\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 本次课程只涉及到Document 文档、Paragraph段落、Run文字块，Table表格对象，还有图片，Word文档结构还是非常复杂。\n",
    "\n",
    "* 大家可以把 docx 文件的扩展名改为 zip，并解压，看一下解压后的文件里面包含什么，这里有更多的**秘密**。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 参考文档\n",
    "\n",
    "中文：[https://www.osgeo.cn/python-docx/](https://www.osgeo.cn/python-docx/)\n",
    "\n",
    "英文：[https://python-docx.readthedocs.io/](https://python-docx.readthedocs.io/)\n",
    "\n",
    "* 未来如果要深入研究，可以参考这些文档"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
